Detecting Rare and Weak Spikes in Large Covariance Matrices
نویسنده
چکیده
Given p-dimensional Gaussian vectors Xi iid ∼ N(0,Σ), 1 ≤ i ≤ n, where p ≥ n, we are interested in testing a null hypothesis where Σ = Ip against an alternative hypothesis where all eigenvalues of Σ are 1, except for r of them are larger than 1 (i.e., spiked eigenvalues). We consider a Rare/Weak setting where the spikes are sparse (i.e., 1 r p) and individually weak (i.e., each spiked eigenvalue is only slightly larger than 1), and discover a phase transition: the two-dimensional phase space that calibrates the spike sparsity and strengths partitions into the Region of Impossibility and the Region of Possibility. In Region of Impossibility, all tests are (asymptotically) powerless in separating the alternative from the null. In Region of Possibility, there are tests that have (asymptotically) full power. We consider a CuSum test, a trace-based test, an eigenvalue-based Higher Criticism test, and a Tracy-Widom test [28], and show that the first two tests have asymptotically full power in Region of Possibility. To use our results from a different angle, we derive new bounds for (a) empirical eigenvalues, and (b) cumulative sums of the empirical eigenvalues, both under the alternative hypothesis. Part (a) is related to those in [4, 33], but both the settings and results are different. The study requires careful analysis of the L-distance of our testing problem and delicate Radom Matrix Theory. Our technical devises include (a) a Gaussian proxy model, (b) Le Cam’s comparison of experiments, and (c) large deviation bounds on empirical eigenvalues.
منابع مشابه
Positive-Definite 1-Penalized Estimation of Large Covariance Matrices
The thresholding covariance estimator has nice asymptotic properties for estimating sparse large covariance matrices, but it often has negative eigenvalues when used in real data analysis. To fix this drawback of thresholding estimation, we develop a positive-definite 1penalized covariance estimator for estimating sparse large covariance matrices. We derive an efficient alternating direction me...
متن کاملPositive-Definite l1-Penalized Estimation of Large Covariance Matrices
The thresholding covariance estimator has nice asymptotic properties for estimating sparse large covariance matrices, but it often has negative eigenvalues when used in real data analysis. To fix this drawback of thresholding estimation, we develop a positive-definite l1penalized covariance estimator for estimating sparse large covariance matrices. We derive an efficient alternating direction m...
متن کاملComparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering.
Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of t...
متن کاملTesting High-dimensional Covariance Matrices, with Application to Detecting Schizophrenia Risk Genes.
Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Lea...
متن کاملSpectrum estimation for large dimensional covariance matrices using random matrix theory
Estimating the eigenvalues of a population covariance matrix from a sample covariance matrix is a problem of fundamental importance in multivariate statistics; the eigenvalues of covariance matrices play a key role in many widely techniques, in particular in Principal Component Analysis (PCA). In many modern data analysis problems, statisticians are faced with large datasets where the sample si...
متن کامل